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ABSTRACT 

We present a test of different error estimators for 2-point clustering statistics, ap- 
propriate for present and future large galaxy redshift surveys. Using an ensemble of 
very large dark matter ACDM N-body simulations, we compare internal error es- 
timators (jackknife and bootstrap) to external ones (Monte-Carlo realizations). For 
3-dimensional clustering statistics, we find that none of the internal error methods 
investigated are able to reproduce neither accurately nor robustly the errors of exter- 
nal estimators on 1 to 25/i~^Mpc scales. The standard bootstrap overestimates the 
variance of ^(s) by ~ 40% on all scales probed, but recovers, in a robust fashion, the 
principal eigenvectors of the underlying covariance matrix. The jackknife returns the 
correct variance on large scales, but significantly overestimates it on smaller scales. 
This scale dependence in the jackknife affects the recovered eigenvectors, which tend 
to disagree on small scales with the external estimates. Our results have important 
implications for the use of galaxy clustering in placing constraints on cosmological 
parameters. For example, in a 2-parameter fit to the projected correlation function, 
we find that the standard bootstrap systematically overestimates the 95 % confidence 
interval, while the jackknife method remains biased, but to a lesser extent. The scatter 
we find between realizations, for Gaussian statistics, implies that a 2-a confidence in- 
terval, as inferred from an internal estimator, could correspond in practice to anything 
from 1-cr to 3-(t. Finally, by an oversampling of sub-volumes, it is possible to obtain 
bootstrap variances and confidence intervals that agree with external error estimates, 
but it is not clear if this prescription will work for a general case. 

Key words: galaxies: statistics, cosmology: theory, large-scale structure. 



1 INTRODUCTION 

The large-scale structure (LSS) of the Universe, as traced by 
galaxies and clusters, encodes important information about 
the basic cosmological parameters and also the physical pro- 
cess which underpin the formation of cosmic structures. LSS 
measurements are now routinely used in conjunction with 
measurements of the cosmic microwave background (CMB) 
to constrain cosmological parameters (e.g. Cole et al. 2005; 
Sanchez et al. 2006; Tegmark et al. 2006; Percival et al. 
2007). However, the estimation of errors in CMB measure- 
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ments is, by comparison with LSS work, quite sophisticated 
and rigorous. 

To constrain the cosmological world model from the 
galaxy distribution, large volumes and galaxy numbers are 
needed. The current generation of local galaxy surveys, such 
as the 2dFGRS (CoUess et al. 2001) and SDSS (York et al. 
2000) satisfy both of these requirements and can be divided 
into large sub-samples to study clustering trends with galaxy 
properties (e.g. Norberg et al. 2001, 2002; Zehavi et al. 2002, 
2004, 2005; Madgwick et al.2003; Croton et al. 2004, 2007; 
Gaztanaga et al. 2005; Li et al. 2006, 2007). In anticipation of 
even bigger surveys at intermediate and high redshifts (e.g. 
GAMA, VVDS-CFHTLS, Euclid), it is timely to revisit the 
techniques used to estimate errors on clustering statistics 
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from galaxy redshift surveys. In particular, we explore ways 
to robustly quantify reliable errors estimates on two-point 
clustering statistics in 3-dimensional space. This has impor- 
tant consequences for the determination of the cosmological 
parameters, both in terms of setting values and in the num- 
ber of free parameters needed to describe the data, and in 
uncovering trends which can be used to constrain galaxy 
formation models. 

In a perfect world, the optimal way to estimate the er- 
ror on a clustering measurement would be to generate a 
large sample of independent mock galaxy catalogues that 
look the same as the data (at least in so far as the ob- 
served data may be affected by sampling fluctuations, e.g. 
Efstathiou & Moody 2001). The challenge here is to ensure 
that the mocks are a faithful reproduction of the observa- 
tions. This is a more demanding requirement than one might 
at first think. In order to estimate accurate errors on the 
two-point statistics, it is necessary that the mocks also re- 
produce the higher order clustering displayed by the data, 
since the errors on 2-point statistics implicitly depend on 
these higher order moments. This approach will inevitably 
involve N-body simulations, in order to accurately model 
the underlying dark matter, and so becomes computation- 
ally expensive. Various techniques have been adopted in the 
literature to populate such simulations with galaxies (for 
a survey see Baugh 2008). The number of simulations re- 
quired is large, running into several tens to get reasonable 
estimates of the variance and hundreds or even thousands 
to get accurate covariance matrices. 

An alternative empirical method is to use the observed 
data itself to make an estimate of the error on the mea- 
surement. Such "internal" error estimates use a prescrip- 
tion to perturb the data set in some way in order to make 
copies. These copies allow the statistical distribution which 
underlies the data to be probed without having to assume 
anything about its form. In this paper we investigate the 
performance of two common approaches, the jackknife and 
bootstrap internal error estimates; both will be compared to 
the errors calculated from a large suite of simulated mock 
data sets. 

The jackknife method was developed as a quite generic 
statistical tool in the 1950s (Quenouille 1956; Tukey 1958). 
The bootstrap method (e.g. Efron 1979) is a modification 
or extension of the jackknife made possible by the availabil- 
ity of fast computers in the 1970s. These internal estimators 
first divide the data set into subsamples, which consist ei- 
ther of individual objects or groups of objects, which are 
then resampled in a particular way (see Section 2 for a full 
description of the procedure). The first applications of ei- 
ther technique to astronomical observations date from the 
early 1980s with the analysis of the velocities of galaxies in 
clusters (Lucey 1979; Bothun et al. 1983). The bootstrap 
method was first applied to galaxy clustering by Barrow, 
Bhavsar & Sonoda (1984; see also Ling, Frenk & Barrow 
1986). 

In these early works sub-samples were simply taken us- 
ing the individual objects themselves. However, as pointed 
out by Fisher et al. (1994), this can lead to unreliable errors. 
From a Monte-Carlo analysis using a series of N-body simu- 
lations. Fisher et al. showed that when sampling individual 
galaxies the bootstrap method underestimates the error on 
the density estimate in voids and, likewise, overestimates it 



in clusters. Mo, Jing & Borner (1992) present an analytic 
estimate for the errors on 2-point statistics and show that 
the bootstrap errors obtained by resampling using individ- 
ual galaxies give incorrect errors compared with the ensem- 
ble average under certain conditions. These valid objections 
can be avoided by resampling the data set divided into sub- 
volumes instead of individual galaxies. Hence, in this paper, 
we generate copies of data sets by selecting sub-volumes of 
the data set. 

Resampling of patches or regions of surveys has already 
been applied to error estimation in LSS analyses. The jack- 
knife technique has been extensively used in projection, in- 
cluding the angular clustering of galaxies (Scranton et al. 

2002) , in the analysis of CMB maps (e.g. Gaztanaga et al. 

2003) and, perhaps most heavily, in efi'orts to detect the In- 
tegrated Sachs- Wolfe effect by cross-correlating sky maps of 
galaxy or cluster positions with temperature fluctuations in 
the cosmic microwave background (Fosalba, Gaztafiaga & 
Castander 2003; Fosalba & Gaztanaga 2004; Cabre et al. 
2007). In projection, the jackknife variance agrees well with 
Monte-Carlo estimates from mock catalogues (e.g. Cabre 
et al. 2007). This is perhaps an easier case than the three 
dimensional surveys dealt with in this paper, since the dis- 
tributions of galaxy fluctuations will tend to look more 
Gaussian in projection (although distinct non-Gaussian be- 
haviour is clearly evident in the projected galaxy fluctua- 
tions, e.g. Gaztanaga 1994). 

The jackknife method has also been applied by many 
authors to the estimation of errors on galaxy clustering in 
three dimensions using volume resampling. Zehavi et al. 
(2002) carried out a simple test of the accuracy of the jack- 
knife estimate of the variance and found that the jackknife 
produced an essentially unbiased estimate, but with a scat- 
ter that could approach 50%. Here we carry out a more 
exhaustive comparison of internal error estimators with the 
"external" Monte-Carlo simulation methods. 

The paper is arranged in the following way. Section [5] 
defines the three error estimators we use throughout, while 
Section |3] introduces the simulations which will be used to 
generate fake data sets for our numerical experiments, along 
with the clustering statistics and the principal component 
decomposition, needed for the error estimator comparison. 
Section 2] presents the raw results from three different er- 
ror techniques and remains rather technical. We consider in 
Section [5] a simple test case scenario, where the implications 
of using different error estimators for a "straightforward" 
two parameter fit to the projected correlation function are 
inferred. We summarize our findings in Section |6] 



2 REVIEW OF ERROR ESTIMATION 
METHODS 

There are numerous ways to estimate the error on a clus- 
tering measurement. In this section, we give an outline of 
four of the most popular non-parametric methods in use 
in the literature, three of which we compare in this paper. 
We do not consider in this paper analytic error estimates 
(like Poisson), nor parametric error estimates, like those de- 
rived from Gaussian or Log-Normal density fields. The latter 
methods are commonly used for estimating errors in the lin- 
ear clustering regime (see e.g. Percival et al. 2001 and Cole 
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et al. 2005 for applications to power spectrum estimates). 
Below, we describe three "internal" methods which use the 
data itself to derive an estimate of the error on a measure- 
ment, the sub-sample, jackknife and bootstrap techniques 
(sections 12.21 12.31 and I2.4|l . Then we describe in section 12.51 
the commonly used "external" method of creating Monte- 
Carlo realizations or reproductions of the data. We close by 
giving an overview of the advantages and disadvantages of 
each method in Section [2.61 

Each of the techniques involves performing measure- 
ments on copies of the data in order to sample the under- 
lying probability distribution of the quantity we are trying 
to measure. The internal estimates make copies or resam- 
plings from the observed data whereas the external method 
generates fake data sets, without manipulating the observed 
data. Three assumptions are implicit in the internal or "data 
inferred" approaches: 1) the data gives an accurate represen- 
tation of the underlying probability distribution of measure- 
ments; 2) the sub-samples into which the data is split are 
sufficient in number to allow accurate enough estimates of 
the errors and 3) the sub-sample volumes are large enough 
to be representative. Condition 1 would be violated if a sec- 
ond independent measurement from an equivalent data set 
gave a significantly different estimate of the measurement. 
In such a case, this would mean that the original data set 
is subject to sampling fluctuations (sometimes called cosmic 
variance) which affect the clustering measurements. Condi- 
tions 2 and 3 are related. The number of sub-samples that 
should be used in the internal estimators depends on the 
questions to be answered and the (unknown) form of the 
underlying probability distribution: for example, to obtain 
an estimate of the variance accurate to 10% would require 
50 resamplings of the data set for a Gaussian distribution. 
At the same time, we need to be careful not to make the sub- 
samples so small so that they become strongly correlated. 
Norberg & Porciani (in prep.) investigate this problem using 
the two-pair correlation function. Later we address the ques- 
tion of whether or not one can put qualitative constraints, 
and maybe even quantitative constraints (using a large suite 
of simulations), on the representativeness of a given size of 
sub-sample. 

2.1 Internal estimates: preliminaries 

The first applications of non-parametric internal techniques 
to the estimation of the error on the two-point correlation 
function considered the removal of individual objects from 
the data set (Barrow, Bhavsar & Sonoda 1984; Ling, Frenk & 
Barrow 1986). This is computationally infeasible for modern 
galaxy catalogues, and it has been shown at galaxy num- 
ber densities currently considered that this type of error 
estimate strongly underestimates the true clustering uncer- 
tainty (e.g. Mo, Jing & Borner 1992; Fisher et al. 1994). 
Furthermore, we want to reduce the correlation between the 
sub-samples into which the data set is split. For this reason, 
we divide the samples by volume and split our data sets 
into Ng^i, cubes of equal volume^. The internal methods 

^ We defer to another paper the discussion of the dependence of 
the results on the exact shape of the sub-volumes. For now, we 
simply note that Cabre et al. (2007) found for angular clustering 



then reconstruct copies of the original data sets, choosing or 
weighting the Nsub sub-volumes in different ways. For each 
copy or resampling of the data, we make an estimate of the 
correlation function, which we denote by Xi in this section. 
Here the subscript i refers to the bin of spatial separation 
and the superscript k refers to the resampling of the data 
for which the correlation function is measured. Note that 
we use the terms "a resampling" and "copy" of the data set 
interchangeably; here, a resampling refers to a "full" copy of 
a data set, rather than to the act of selecting an individual 
sub- volume. 

2.2 Internal estimate 1: The sub-sample method 

The covariance matrix of A'^ independent realizations is, by 
definition, given by 

C{X^,XJ) = —^{Xi~X^){x'^j-Xj), (1) 
fc = l 

where it is assumed that the mean expectation value, Xi, 
is not estimated from the {Xi}^^i samples, but from an 
independent realization of the data. 

Hence, the simplest error method, commonly referred 
to as the sub-sample method, consists of splitting the data 
set into Ng^b independent samples and estimating the co- 
variance matrix using Eq. [1] where the clustering statistic 
is estimated for each one of the sub-samples separately. For 
Nsub independent sub-samples, this returns the correct co- 
variance for a sample of volume I /Nsub of the original vol- 
ume, implying that the covariance of the full dataset is ac- 
tually Nsub times smaller (as the variance scales directly 
with the volume considered). This method has been used in 
several studies, in particular where the survey volumes con- 
sidered are large (e.g. Maddox et al. 1990, Hamilton 1993a 
and Fisher et al. 1994 for the APM, the IRAS 2Jy and the 
IRAS 1.2Jy galaxy surveys respectively). 

However, one basic assumption made in this approach is 
never really satisfied for galaxy clustering studies in the Uni- 
verse: the sub-samples are never fully independent of each 
other, irrespective of the clustering scales considered. This is 
due to the presence of long-range modes in the density fiuc- 
tuations, making all sub-samples to some extent correlated 
with each other. This can be related to the fact that the 
correlation function has a small but non-zero value on large 
scales. Therefore there is a need to consider alternative in- 
ternal estimators, accounting hopefully for these limitations. 
Hereafter we will not consider the sub-sample method, even 
though it has been extensively used in the past. 

2.3 Internal estimate 2: The jackknife method 

We consider the "delete one jackknife" method (Shao 1986). 
A copy of the data is defined by systematically omitting, 
in turn, each of the Nsub sub-volumes into which the data 
set has been split. The resampling of the data set con- 
sists of the Nsub — 1 remaining sub-volumes, with volume 
{Nsub — 1)/Nsub times the volume of the original data set. 

studies that "irregular" shapes can jeopardize the internal error 
methods, particularly the jackknife. 
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The clustering measurement is repeated on the copy or re- 
sampling of the original data set. By construction, there 
are only N =Nsub different copies of the data set that are 
created in this way. The covariance matrix for A*' jackknife 
resamplings is then estimated using 

(N 11 ^ 

Cj),{Xi,Xj) = — ^{x'l ~ X^){x) - Xj) , (2) 

fc = l 

where Xi is the f"^ measure of the statistic of interest (out 
of A'^ total measures), and it is assumed that the mean ex- 
pectation value is given by 

JV 

X, = ^xf/iV. (3) 
fc=i 

Note the factor of A'^— 1 which appears in Eq.[2](Tukey 1958; 
Miller 1974). Qualitatively, this factor takes into account the 
lack of independence between the A'^ copies or resamplings 
of the data; recall that from one copy to the next, only two 
sub- volumes are different (or equivalently N — 2 sub- volumes 
are the same). Hereafter, we will refer to a jackknife estimate 
from Neub sub-samples as Jack-A^sut. 

2.4 Internal estimate 3: The bootstrap method 

A standard bootstrap resampling of the data set is made 
by selecting Nr sub-volumes at random, with replacement, 
from the original set (Efron 1979). Effectively, a new weight 
is generated for each sub- volume. In the original data set, all 
sub- volumes have equal weight. In a resampled data set, the 
weight is simply the number of times the sub-volume has 
been selected e.g. Wi = 0,1,2.... The clustering measure- 
ment is repeated for each resampled data set. For a given 
Nr, the mean fractional effective volum^EI of the resampled 
data sets tends to a fixed fraction of the original sample vol- 
ume. For Nr=Nsub, the mean effective volume is less than 
the volume of each of the jackknife resamples. We further 
develop this discussion in H4.2l and H5.5I 

Unlike for jackknife error estimates, in principle there 
is no limit on the number A'^ of resamplings or copies of 
the data for bootstrap error estimates. In practice, the vari- 
ance on a measurement converges relatively slowly with 
increasing numbers of trials (Efron & Tibshirani 1993). 
Furthermore, for our application, resamplings are cheap 
to generate but expensive to analyse. For a data set di- 
vided into Nsub sub-volumes and from which one draws 
Nr sub- volumes at random with replacement, there are 
{Naub + Nr - iy./{Naub - l)!Afr! different possible boot- 
strap resamplings, which even in the modest example of 
Nr—Nsub — W corresponds to 92 378 different bootstrap re- 
samplings. Here we restrict ourselves to of the order of one 
hundred resamplings (i.e. A'^ ~ 100). Until now, most, if not 
all, bootstrap clustering estimates have used resamplings 
consisting of Nr—Nsub sub-volumes. In this paper we test 
this assumption by considering up to 4 Nsub sub- volumes to 
construct each resampling of the data set (Section 15. 5p . 



^ The fractional effective volume of a resampled data set is given 
by the ratio between the number of unique sub- volumes selected 
and the total number of sub-volumes the data set is split into. 



The covariance matrix for TV bootstrap resamplings of 
the data is given by 

Cbootix^,Xj) = J^—Y'^{x'^-Xi){Xj-Xj), (4) 
fc = l 

where it is assumed that the mean expectation value is given 
by Eq.|3l Note that there is no A'^ — 1 factor in the numerator 
of this expression, as was the case for the jackknife. Quali- 
tatively the data set copies are thought of as being "more" 
independent in the case of bootstrap resampling than for 
jackknife resampling, something we address in detail in Sec- 
tions 13] and O In what follows we will refer to the mean 
bootstrap estimate from Nsub sub-samples as Boot-A^sut- 

2.5 External estimate: Monte Carlo realizations 

The Monte Carlo method consists of creating A'^ statistically 
equivalent versions of the data set being analysed, on each 
of which the full analysis is repeated. Technically, bootstrap 
resampling is also a Monte-Carlo method. However, the dis- 
tinction here is that what we have termed the Monte Carlo 
approach makes no explicit reference to the observed data 
set to generate the synthetic data sets. We are not resam- 
pling the data in any way. Instead, we assume that we know 
the underlying statistical or physical processes which shaped 
the observed data set and feed these into a computer simula- 
tion. Here we have run N-body simulations which model the 
clustering evolution of the dark matter in the universe (see 
Section [3. ip . The Monte Carlo element in this case refers to 
the initial conditions, which are drawn from a distribution of 
density fluctuations consistent with cosmological constraints 
(see e.g. Sanchez et al. 2006). The level of realism of the com- 
puter simulation determines the cost of this method. For a 
clustering analysis of an observed galaxy catalogue, the de- 
mands on the Monte Carlo approach are even greater as the 
N-body simulations need to be populated with galaxies, ac- 
cording to some prescription, with the goal of statistically 
reproducing the galaxy sample as faithfully as possible (see 
e.g. Baugh 2008 for a review of the techniques used to build 
synthetic galaxy catalogues). 

We hereafter refer to the Monte Carlo catalogues gener- 
ated in the external error estimation as "mocks" , and their 
associated errors as mock errors. In the current work using 
N-body simulations, we use Eq. [J] as the definition of the 
Monte Carlo covariance matrix, since we define our "refer- 
ence sample" as the mean measurement of the correlation 
function extracted from the ensemble of simulations. 

2.6 Pros and cons of each method 

Each of the error estimation techniques described above has 
its advantages and disadvantages. By definition, errors cal- 
culated directly from the data take into account any hidden 
or unforeseen systematics and biases that might otherwise be 
missed. This is particularly important for clustering statis- 
tics, where the errors on the 2-point correlation function also 
depend on the higher order clustering of the data. These 
properties of the galaxy distribution are usually not other- 
wise appropriately accounted for in an error analysis. Other 
properties of the data, such as the galaxy mix and survey se- 
lection, are also naturally satisfied in an internal approach. 
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In the case of external error estimation using mocks, only the 
statistics that have been deliberately included in the Monte 
Carlo realization are guaranteed to be taken into account in 
the error analysis. If a biasing scheme has been constrained 
to reproduce the two-point function of galaxy clustering, 
there is no guarantee that the higher order moments of the 
distribution will also match those of the observed data set. 

On the other hand, internal error estimates are often 
severely limited by the size of the data set itself. This can 
be particularly problematic for clustering statistics, as stud- 
ied here. Obviously, sampling or cosmic variance on scales 
larger than sample volume cannot be addressed by internal 
estimates, but can be included in external estimates made 
using mocks. 



3 THE NUMERICAL EXPERIMENTS 

Our aim in this paper is to repeat the early comparisons 
of internal and external error estimates in the context of 
current and forthcoming galaxy redshift surveys. We draw 
data sets from numerical simulations which are described 
in Section 13.11 The clustering analysis of these data sets 
is described in Section 13.21 Finally, in Section 13.31 we give 
an outline of the principal component decomposition of the 
covariance matrix of clustering measurements, which is an 
invaluable diagnostic in error estimation. 



3.1 Creating test data sets 

Our analysis uses the z = 0.5 output from the L-BASICC 
ensemble of N-body simulations carried out by Angulo et al. 
(20080. The set comprises 50 moderate resolution runs, each 
representing the dark matter using 448^ = 89 915 392 parti- 
cles of mass 1.85x10^^ h'^ Mq inaboxof side 1340/i~^Mpc. 
Each L-BASICC run was evolved from a different realization 
of a Gaussian density field set up at z = 63. The adopted 
cosmological parameters are broadly consistent with recent 
data from the cosmic microwave background and the power 
spectrum of galaxy clustering (e.g. Sanchez et al. 2006): 

= 0.25, = 0.75, (78 = 0.9, n = 1, and = -1. 
Throughout we assume a value for the Hubble parameter of 
h = i/o/(100kms-^Mpc-^) = 0.73. 

The combination of a large number of independent re- 
alizations and the huge simulation volume make the L- 
BASICC ensemble ideal for our purposes. Angulo et al. 
(2008) showed that nonlinear and dynamical effects are still 
important for the evolution of matter fluctuations even on 
scales in excess of 100/i~^Mpc and that these scales con- 
tribute to the growth of smaller scale perturbations. A 
smaller simulation volume would not be able to model the 
growth of fluctuations accurately because these long wave- 
length modes would be missed. The large volume also means 
that it is still possible to extract more than one region from 
each simulation which can be considered as being effectively 
independent from the others due to their large spatial sep- 
aration. Equally important it is also possible to mimic with 
such a large simulation volume the real situation of a survey 



^ Due to an unfortunate labelling of the simulations outputs, we 
did not use the redshift zero outputs as initially intended. 



Table 1. A summary of the numerical experiments conducted 
in this paper. Col. 1 lists the error estimation technique applied; 
col. 2 gives the number of sub-samples into which each data set 
is split up; col. 3 indicates whether the analysis was performed 
in real (r) or redshift (z) space; col. 4 gives the number of differ- 
ent data sets used; col. 5 shows the number of resamplings and 
clustering measurements performed for each data set, N; col. 6 
lists, for the case of bootstrap errors only, the relative number of 
sub-volumes selected at random with replacement from the orig- 
inal list w.r.t. A^suj,; col. 7 shows the sampling fraction w.r.t. our 
nominal mean density which is set to match that of a L* galaxy 
sample. The first group of experiments yielded our main results 
and used in the majority of the plots; the other experiments are 
variants to test different components of the analysis and referred 
to in the text. 
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like the 2dFGRS, for which there are two distinct survey 
regions in two nearly opposite directions on the sky. 

From each of the 50 simulation cubes we extract two 
cubical sub- volumes of 380 Mpc on a side which are sep- 
arated by at least ~ 500 Mpc from one other. We could 
have extracted more than 40 volumes of this size from each 
simulation cube without choosing the same volume twice. 
However, we chose to follow this more conservative approach 
as we want to be able to treat the two sub-volumes as be- 
ing effectively independent of one another. Although they 
come from the same simulation volume, the fluctuations on 
scales greater than 500 Mpc are relatively weak. Hence, 
the total number of available data sets constructed in this 
way comes to 100, each of which is fully independent of 
98 of the others and essentially independent of the remain- 
ing 99"^ data set. The size of each data set is chosen to 
match the volume of a typical L* volume-limited sample 
extracted from the main SDSS galaxy redshift survey (e.g. 
Zehavi et al. 2004, 2005). Note that this volume is about 
10 times larger than the equivalent volume-limited samples 
of L* galaxies used in the flnal 2dFGRS clustering analyses 
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(Norberg et al. in prep.). Finally, we have randomly diluted 
the number of dark matter particles in each data set in or- 
der to match the number density of a L* galaxy sample of 
3.7 X 10"^ /i"^ Mpc^, which mimics the discreteness or shot 
noise level in typical observational data sets. 

Note that on this occasion, we do not attempt to model 
a particular galaxy sample in detail, as our aim here is to 
provide a generic analysis that is as widely applicable and re- 
producible as possible, within a framework which is likely to 
remain the same in the foreseeable future. Moreover, if one 
attempted to model a galaxy sample rather than the dark 
matter, this would open up the issue of how well can the 
model reproduce the higher order clustering of the galax- 
ies. This is a problem to be addressed by those building 
mock galaxy catalogues for particular surveys and is beyond 
the scope of the current paper. By focusing on dark matter 
clustering only, our analysis remains well defined and fully 
accurate within our chosen cosmological model. 

A summary of the samples and error calculations car- 
ried out is given in Table [1] Each data set drawn from a 
simulation can be divided into sub-volumes in order that it 
be resampled according to the jackknife or bootstrap algo- 
rithm. Thus, we have up to 100 independent experiments for 
which we can compare internal estimates of the error on the 
correlation function, and consider for the first time in detail 
the whole error distribution of internal error estimates. The 
external "mock" estimate of the error is obtained using all 
100 data sets. In all, we have performed about 250 thou- 
sand correlation function estimates, representing about one 
million CPU hours of calculations. 

3.2 Estimating the two-point correlation function 

Throughout this paper we consider only standard two-point 
clustering statistics and we delay to a future paper the study 
of more recent two-point clustering statistics, like the ua- 
statistic of Padmanabhan, White & Eisenstein (2007). We 
measure the two point correlation function, ^x('"p,7r), as a 
function of pair separation perpendicular to (rp) and paral- 
lel to (tt) the line of sight of a fiducial observer, in real (X=r) 
and in redshift space (X=s). In order to remain as compa- 
rable as possible with the analysis of observational data, we 
do not use the common distant observer approximation, but 
instead perform a realistic angular treatment with a location 
chosen for the observer within the simulation box, i.e.: 

rios = (rl + ri)/2 

= ri ■ rl^ + r2 ■ fl^ 
TT = v/(ri - r2)2 - (rp)2 , (5) 

where ri and are the position vectors of the two objects in 
the pair and a hat indicates that the vector is appropriately 
normalized; rios is the mean distance to the pair of objects 
along the line of sight from the observer. Redshift space 
distortions are modelled in the same way, i.e.: 

-ztot = 2hub + V • r /c , (6) 

where Zhub is the galaxy's redshift without peculiar motions, 
V is its peculiar velocity in addition to the Hubble fiow and 
c is the speed of light. 

We calculate the two dimensional correlation function 
^x('"p,7r) for each data set using standard estimators, e.g. 



Hamilton (1993) and Landy & Szalay (1993). In order to 
obtain a robust estimate of the mean density, these estima- 
tors require that a large number of points be randomly dis- 
tributed within the boundaries of the data set, according to 
the angular and radial selection functions which define the 
data. The number of randomly distributed points is typi- 
cally larger than the number of data points by a factor of 
/ ~ 100, and we use the technique outlined in Landy & Sza- 
lay (1993) to speed up the calculation of the random-random 
pair counts. In practice, to avoid fluctuations in the mean 
density at small pair separations, we typically use 4 times as 
many random points as data points and repeat the estimate 
of the pair counts about 25 times. We use a unique set of 
randoms for each calculation, so that our statistics are not 
influenced by any features in the randoms, which when an 
experiment is repeated numerous times could actually show 
up in a systematic fashion. Finally to compute the 5s(rp,7r) 
estimate, we limit the counts to pairs separated by less than 
10 degrees for two reasons: 1) results effectively estimated 
in redshift space can be properly interpreted using the red- 
shift space distortion model of Kaiser (1987), developed in 
the distant observer approximation; 2) the projected corre- 
lation function, defined below, can be considered as a real 
space quantity. These issues are discussed in greater detail 
in e.g. Matsubara (2000) and Szapudi (2004), but also ap- 
plied in some clustering studies, like Hawkins et al. (2003) 
and Cabre & Gaztanaga (2008). Using logarithmic binning 
in both the rp and 7r-directions, we estimate the projected 
correlation function, Wp(rp)/rp (sometimes also written as 
H((T)/a and with a = rp), by integrating ^x(rp,7r) in the 
TT direction (see Eq. [7] below). The only real motivation for 
this exercise is that the projected correlation function is, in 
theory, free from distortions arising from gravitationally in- 
duced peculiar motions. This point is discussed further in 
Section |4T] 

We also estimate the spherically averaged correlation 
functions, £,r{s) and fs(s) in "r" real and "s" redshift space 
respectively, by averaging over shells in pair separation given 
by s = \/rp + ■k'^. As for the projected correlation function 
we use logarithmic binning and accumulate pair counts di- 
rectly, i.e. not by integrating over the 2-d correlation func- 
tion estimate. 

Given ^x(rp,7r), the different clustering statistics are 
related by 

Wp{rp)/rp = — ^x(rp,7r)d7r . (7) 

''p Jo 

Cx(s) = < Cx(rp,7r) > (8) 

with X=r or s, real and redshift space respectively. Here, the 
integral for Wp(rp)/rp is carried out to a maximum value of 
the separation along the line of sight of vTmax. Theoretically 
Wp(rp) is only a true real space quantity when iVmax = oo. 
In Section 14.11 we discuss the systematic implications of a 
finite choice for TTmax. 



3.3 Principal component decomposition 

The covariance matrix of correlation function measure- 
ments, is not, in itself, of much interest. However, the inverse 
of the covariance matrix plays a pivotal role, appearing, for 
example, in all simple estimates. As matrix inversions 
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are highly non-hnear by nature, the impact of noise in the 
estimate of the covariance matrix is hard to judge, unless ex- 
tensive tests are carried out. A number of procedures exist 
to help with the matrix inversion and analysis. 

First, we prewhiten the covariance matrix, which means 
that all diagonal terms are rescaled to unity and all non- 
diagonal terms are rescaled to fall between -1 and 1, 

C = a Cn (7 , (9) 

where Cn is the normalized covariance matrix, and ai,j = 
Ci^iSij, hereafter also referred to as ai. 

Second, we decompose the Nu x A''„ normalized covari- 
ance matrix into its principal components, solving the eigen- 
equations: 

CnE; = A,Ei i = l,...,Nu, (10) 

where Ai and Ei are the normalized eigenvalues and eigen- 
vectors of the normalized covariance matrix. We have delib- 
erately defined Nu as the dimension of the covariance matrix 
actually used in the analysis, and not assumed it to be nec- 
essarily equal to the total number of data points for the 
given statistic (e.g. the number of bins of pair separation in 
the correlation function). Even though the parts of the co- 
variance matrix that are used do not depend on those that 
are not used, principal component decomposition and ma- 
trix inversion strongly rely on the full matrix set up. For 
that reason, it is essential to pre-define the parts of the co- 
variance matrix to be considered, a fact that is most often 
neglected. 

There are many useful properties of principal com- 
ponent decomposition worth listing here (see for example 
Kendall 1975; Press et al. 1992): 

• The {Ei}[^-^ form a complete orthogonal basis, from 
which the whole covariance matrix can be reconstructed. 

• A principal component decomposition minimises the er- 
ror for the number of components used. 

• A ranking of the eigenvectors in order of decreasing 
eigenvalue highlights the principal trends of the covariance 
matrix. 

• For a given covariance matrix there is a unique set 
of principal components. Comparing covariance matrices is 
equivalent to the comparison of principal components. 

With respect to this last point, it is essential to understand 
noise in the statistic and its non-trivial effect on estimating 
the principal components of the covariance matrix. Clearly, 
in the absence of noise, two covariance matrices are identical 
if, and only if, their principal components are identical. In 
real systems that contain noise this is no longer true. 

For this paper the most important use of the principal 
component decomposition technique is to redefine the of 
a model given the data: 

= (xd - xth)^ C"^ (xd - xth) (11) 

yd = E^a-'xd, (13) 

with a defined by Eq. [9] and where E is the rotation matrix 
composed of the unique eigenvectors Ei previously defined 
in Eg. 1101 The beauty of Eq. [T2]is that, when summed over 
all Nu terms, it is exactly equivalent to the standard 



given in Eq. 1111 Additionally, the properties of the principal 
component decomposition ensure it yields the most efficient 
representation of the covariance matrix if we truncate it to 
include fewer than Nu modes. Ifence, hereafter, the number 
of principal components considered is given by Npca, and 
only when Npca = Nu are we considering a full covariance 
matrix analysis. 



4 ANALYSIS AND RESULTS 

In this section we present our main results, carrying out a 
systematic comparison of applying different error estimation 
techniques to data sets which are representative of current 
and forthcoming galaxy surveys. The numerical experiments 
we have carried out are summarised in Table [1] To recap, we 
have created 100 test data sets by extracting independent 
volumes from an ensemble of N-body simulations (JJSTT}. In 
our fiducial case, the density of dark matter particles in the 
test data sets has been diluted to match the abundance of 
L* galaxies. Each data set can be divided into equal sized 
cubical sub- volumes and resampled according to the jack- 
knife and bootstrap algorithms to make internal estimates 
of the error on the two-point correlation function. We exam- 
ine the scatter in the internal error estimates by comparing 
results obtained from different data sets. In our analysis, 
we vary the number of sub-samples each data set is divided 
up into, the sampling rate of the particles, the number of 
sub- volumes selected to construct a copy of the data set in 
the case of the bootstrap and also show results for cluster- 
ing measured in real and redshift-space. Our benchmark in 
this study is the external estimate of the error obtained by 
treating our data sets as 100 independent experiments. This 
is regarded as the "truth" in our paper. 

This section is quite long and rather detailed. For that 
reason, on a first reading of the paper the reader may wish 
to go directly to the summary in Section [U] or to the case 
study in Section O In Section [4. II we present the correlation 
function measurements made from the ensemble of data sets. 
We then go through different aspects of the error analysis: 
in Section [4.21 we compare the relative errors obtained from 
the different techniques; in Section|4]3]we look at the uncer- 
tainty in the error; in Section B^ and Section r4.5l we examine 
the distribution of eigenvalues and eigenvectors respectively 
of the covariance matrices produced by each approach, be- 
fore ending in Section [4.61 with a discussion of the stability 
of the inversion of the respective covariance matrices. 

4.1 Correlation functions 

We present results for the projected correlation function 
and the spherically averaged correlation functions in both 
real and redshift space. Selected results later on in the Sec- 
tion depend on the number of bins used to represent the 
correlation function. In most plots we focus on the scales 
1.0 < r / /i"^ Mpc < 25 for the following reasons. First, this 
range is accurately modelled in the L-BASICC simulations, 
and can be measured reliably within all of the sub- volumes 
the data sets are split into. Second, the model fitting pre- 
sented in Section [S] is only valid down to ~ 1.0 Mpc, 
and it therefore makes sense to present other results over 
the same range of scales for comparison. Third, this range 
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Figure 1. Left: The spherically averaged correlation function of the dark matter, estimated in real (blue) and redshift space (red) 

for 100 data sets extracted from the L-BASICC simulation ensemble, as described in i|3.1l Right: The projected correlation function, 
Wp(rp), of the dark matter estimated directly in real space and in redshift space by integrating ^x(''pi''") for the same data sets as used 
in the left panel. The projected correlation functions are integrated out to two limits: TTmax = 64 & 30 Mpc, as indicated by the 
legend. For clarity, the redshift space estimate integrated out to TTmax = 30 Mpc is omitted. In both panels, the error on the mean 
of the quantity is plotted. See i|4.1l for comments. 



is the most appropriate to draw direct analogies with more 
general analyses of the error in galaxy clustering measure- 
ments. On sub-Mpc scales the precise way that galaxies are 
distributed within their host halo strongly influences the cor- 
relation function and its error. In real-space, scales larger 
than ~ 20 /i"^ Mpc are even more problematic as estimates 
can be affected by systematics at the ~ 30 % level (see 
Fig. [l]), if special care is not taken when accounting for red- 
shift space distortions. Fourth, this choice of scale produces 
clustering estimates that can be split into sufficiently small 
but still well sampled bins. 

We first present measurements of the basic correlation 
functions we will be working with throughout the remain- 
der of the paper. Fig. [T] shows the spherically averaged cor- 
relation functions, ir{s) and ^s(s), and the projected cor- 
relation function, Wp(rp)/rp, in the left and right panels 
respectively, as estimated from the 100 independent data 
sets of 380 Mpc aside. In the left panel of Fig. [T] we 
see the clear impact of redshift space distortions on the 
measured ^s(s). On small scales, the clustering signal is 
severely damped by the peculiar motions of the dark matter 
particles (responsible for the fingers-of-God due to cluster- 
mass haloes in a ^s('"p,7r) representation). On intermediate 
scales, large scale infall enhances the clustering amplitude 
compared with the real space measurement (see Peacock & 
Dodds 1994 for a model of the small and large scale redshift 
space distortion). £,r{s) shows the well known shoulder be- 
tween 1 and 3 Mpc (Zehavi et al. 2004), a sign of the 
transition between the one and two halo terms used in halo 
occupation distribution models (e.g. Berlind et al. 2003). 

In the right hand panel of Fig. [T] we show the projected 
correlation functions, Wp(rp)/rp, in real and redshift space. 



By real-space, here we mean that we integrate over an esti- 
mate of ^r('"p,7r) made without including peculiar motions. 
The comparison between these results illustrates the sensi- 
tivity of the projected correlation function to the choice of 
TTmax iu Eq. [5] Theoretically, if one integrates out to vTmax = 
oo, Eq.[S] would return a purely real space quantity. For nu- 
merical as well as computational reason^ we restrict the in- 
tegration to TTmax < 64 Mpc. Eveu at rp ~ 10 Mpc 
there is a systematic difference of 10 % in this case between 
Wp(rp)/rp estimated for data in real and redshift space. This 
difference increases with scale, and by ~ 30 Mpc it is 
greater than 50 %. Due to the large sub-volumes used this 
systematic effect is statistically non-negligible. Taking the 
individual errorbars at face value (which we show below to 
be ill-advised), at ~ 20 Mpc this difference is about 
1-cr. For comparison purposes, we plot Wp(rp)/rp estimated 
in real space data using vTmax = 30 Mpc (cyan). Even 
in real space, the chosen value of vtmax has non-negligible 
consequences. 

The fact that the real and redshift space measurements 
of the projected correlation function do not perfectly agree 
in the right hand panel of Fig. [T] has important implications 
for the interpretation of Wp(rp). An estimate of the pro- 
jected correlation function made from data in redshift space 
will still be influenced by the underlying redshift space dis- 
tortions if the integral is not performed out to large enough 
scales. In our analysis, the magnitude of the systematic shifts 

The larger the integration range, the more computational 
resources are needed to perform the pair count to estimate 
Cx(t'p,tt), since the number of pairs increases in proportion to 
the additional volume included as tt increases. 
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Figure 2. The mean relative variance of §r(s) (loft), (middle) and Wp(rp) (right) as a function of scale for different error estimators: 

black - 100 "mock" data sets; red - two different samples of the mocks (the first 50 & the last 50); green - bootstrap errors after splitting 
each data set into 27 sub-samples (i.e. Boot-27 v^ith N,-=Ng^i,); blue and cyan — jackknife errors measured after splitting each data set 
into 27 and 64 sub-samples respectively (i.e. Jack-27 and Jack-64). See i|4.2l for comments. 



is calculated for simulated cold dark matter particles. In the 
case of real galaxies, some of which are strongly biased with 
respect to the underlying dark matter, the discrepancy could 
be even larger. One should therefore be cautious about con- 
sidering Wp(rp) as a "real space" measurement, and take 
even more care in understanding the effect of the limiting 

TTmax Value. 



4.2 Mean relative variance as function of scale 

In Fig.[2]we present the mean relative variance of £,r{s) (left), 
^s(s) (centre) and Wp(rp) (right), all as function of scale for 
different error estimators. The mean relative variance for 
each statistic is defined in the top left corner of each panel. 

The relative variance on the clustering measurements 
estimated from the scatter over the 100 data sets is robust 
(solid black) . If we consider the first 50 and last 50 data sets 
as separate ensembles (solid red) , we find very similar results 
for the relative variance. However, for all three statistics con- 
sidered (^r(s), ^s{s) and Wp(rp)), neither the bootstrap nor 
jackknife is able to fully recover the relative variance of the 
mocks as function of scale. Bootstrap estimates for which the 
number of sub- volumes selected, Nr, is equal to Nsut tend to 
systematically overestimate the relative variance by 40-50 % 
on all scales. Jackknife estimates show a scale and statistic 
dependent bias, varying from no bias on scales larger than 
~ 10 Mpc, to a 25 % overestimate for Wp(rp), and as 
much as 200 to 400 % for ^r(s) and 5s(s) on sub- Mpc scales. 
We have checked that these differences are not due to a sys- 
tematic uncertainty in the recovered mean of each statistic: 
even when rescaling the variance by a simple power-law, the 
conclusions remain unchanged. 

A closer look at the data set copies generated here with 
the bootstrap method reveals that the mean efi'ective vol- 
ume of the copies is only ~ 63% of the volume of the original 
data set (i.e. in the case where the number of sub-samples 
selected at random with replacement equals the number of 
sub-samples the data set is divided into, Nr=Nsut)- Perhaps 
somewhat surprisingly, this value for the mean effective vol- 
ume depends only weakly on the number of sub- volumes into 



which the original data set is divided: in fact, for a data set 
split into a large enough number of sub-volumes the mean 
is insensitive to the exact Ns^i, valu^fl- This reduction in 
effective volume could, in principle, explain why the mean 
relative variance is systematically larger for the bootstrap 
than that expected from the mocks. To increase the effective 
volume of the bootstrap copy, we need to sample more sub- 
volumes, i.e. oversample the number of sub-volumes w.r.t. 
the standard case Nr—Nsub- We have experimented with 
oversampling the number of sub- volumes 2, 3 and 4 times 
(lower half of Table [TJ. As expected, the mean relative effec- 
tive volume does tend then towards unity: for Nr =2 Nsub, 
3 A^suf, and 4 Naub, the mean relative effective volumes are 
~ 87%, ~ 95% and ~ 98% respectively. The effect of this 
on the mean relative variance is discussed further in H5.5I 

The shapes of the relative variance computed using the 
different methods are different as a function of scale. While 
ar for ^r{s) and £,s{s) show a clear minimum between 3-6 and 
2-3 Mpc respectively, ar for Wp(rp) shows no evidence for 
an increase in the relative variance on moving to the smallest 
scales considered. This dependence could be related to the 
number density of dark matter particles used in the analysis, 
with ^r{s) and ^a{s) being maybe more sensitive to Poisson 
noise on small scales than the integral statistic Wp(rp). The 
flattening in the relative variance seen in the mocks for (s) 
corresponds to the scale where the dominant contribution 
to the correlation function changes from the one-halo term 
(small scales) to the two-halo term. 

It is important to remember that it is not possible from 
Fig. [5] alone to quantify the inevitable correlations between 
different scales in the clustering statistic. It would be naive 
to assume that these are unimportant, but let us here for 
simplicity consider that naive situation of uncorrelated er- 
rors. In the case of jackknife errors, the different scale depen- 
dence of the relative variance (compared with the "truth") 

^ Whilst the mean effective volume is insensitive to the number 
of sub- volumes the data set is divided into, the form of the overall 
distribution of relative effective volumes does depends on Ng^i,, 
as the only possible values are all integer fractions of N^y^i,. 
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Figure 3. The distribution of projected correlation function measurements for different estimators: left: mock, centre: jackknife and 
right: bootstrap. We plot the distribution of the logj^Q of the measurement divided by the mean for that scale; different colours show the 
distributions for different scales, as indicated by the legend on each panel. In the case of the mocks, the mean is the average over the 
100 data sets. For the jackknife and bootstrap, the mean is averaged over the resamplings of each data set. The dotted lines show the 
corresponding Gaussian distribution, with same mean and variance, the latter indicated on the right in each panel. In this plot, we show 
results for the jackknife and bootstrap for data sets divided up into 27 sub- volumes. 



makes their interpretation more complicated and therefore 
more likely to lead to misleading conclusions, especially if 
the small scale clustering is included. On the positive side, 
the jackknife errors, as presented here, tend to overestimate 
the true errors, which is certainly better than underestimat- 
ing them. In the case of the bootstrap errors, on the other 
hand, the relative variance as a function of scale has a simi- 
lar shape to that of the variance from the mocks, and so the 
variance could be rescale by an appropriate factor to take 
into account the overall overestimate of the relative variance. 
As with jackknife errors, the bootstrap errors, as presented 
here, do not underestimate the errors on any of the scales 
considered, which is a good thing if one is to consider an 
uncorrelated error analysis. 



4.3 Distribution of correlation function 
measurements 

Having compared the estimates of the relative variance of 
the correlation function in the previous section, let us now 
look at the distributions of the measurements of the corre- 
lation functions themselves. The motivation for doing this is 
clear. Recall that in order to be able to interpret fitting 
in the standard way, assigning the confidence intervals asso- 
ciated with a Gaussian distribution to set differences in x^j 
we need to find a quantity which is Gaussian distributed to 
use in the fitting process. 

The distribution of the measured projected correlation 
functions are compared in Fig. [S] Each panel corresponds 
to a different error estimation method (left: mocks; centre: 
jackknife; right :bootstrap). We plot the logj^Q of the mea- 
sured projected correlation function divided by the mean 
value, which shows deviations on either side of the mean 
on an equal footing. The different histograms show the dis- 
tributions of logjQ Wp(rp)/<Wp(rp)> measured on different 
scales covering the range 0.60 to 25 Mpc. The dotted 
curves show the corresponding Gaussian distributions, i.e. 
with same mean and variance. 



A quick comparison of the shapes of the histograms 
in the different panels shows that the distribution of 
logj^Q Wp(rp)/<Wp(rp)> appear Gaussian for all estimates 
on small scales, while on larger scales, this is no longer true 
for jackknife and bootstrap estimates (rp > 5 Mpc and 
Tp > 15 Mpc respectively). On the largest scales con- 
sidered here, even the distribution from the mocks seems 
to be boarder than its corresponding Gaussian equivalent. 
The distributions yielded from the jackknife estimator tend 
to show asymmetry on most scales plotted. Thus, despite 
these deviations from a Gaussian distribution, we conclude 
that logjg Wp(rp) is closer to being Gaussian than Wp(rp) is, 
and, as such, is a more appropriate choice of variable to use 
in a fitting. 

The middle panel of Fig. [3] shows the distribution of 
the jackknife Wp(rp) estimates. Note that, as a result of 
the method of construction, jackknife errors are much more 
highly correlated than all the others, which is reflected by 
the additional factor of A'^ — 1 in the numerator of the expres- 
sion for the covariance matrix (Eq.[2]). Hence, to compare the 
jackknife distribution with the others, we first need to rescale 
the ratio Wp(rp) by \/Nsub — 1- Similarly, it is essential to 
account for this factor when plotting jackknife errorbars on 
data points, as otherwise they do not correspond to the vari- 
ance in the jackknife estimates. A quick comparison of the 
middle and left panels in Fig. [3] shows that the distribu- 
tion of jackknife estimates over this medium range of scales 
is similar to the corresponding mock distribution: the jack- 
knife distribution is slightly wider than the mock on small 
scales, which agrees with the comparisons of the variances 
presented in H4.2I However, this figure also shows a concern- 
ing trend in that the jackknife distribution tends to become 
more asymmetric on larger scales (rp > 10 Mpc), imply- 
ing that jackknife errors for logj^Q Wp(rp) are not Gaussianly 
distributed on such scales. 

Finally, in the right panel of Fig. [3] we show the dis- 
tribution of bootstrap Wp(rp) measurements. On all scales 
considered the distribution is clearly wider than that ob- 
tained from the mocks. This highlights the fact that, when 
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Figure 4. Left : The normalized eigenvalues (i.e. normalized relative variance contributions) for the redshift space correlation function, 
£,s{s), as function of eigenvalue number. The black line shows the eigenvalues for 100 mock data sets, while the two red lines correspond 
to the eigenvalues obtained from the covariance matrix obtained using the first and last 50 mocks only. As in Fig. [2] this illustrates the 
scatter in the mock eigenvalues. The green line shows the equivalent result for the Boot-27 errors, while the blue and cyan lines show 
the Jack-27 and Jack-64 errors respectively. Errorbars indicate the l-cr scatter on applying the error estimates to the 100 data sets. The 
range of pair separations considered in the principal component decomposition is given in each panel. Right: Same as left panel but for 
the projected correlation function, Wp(rp), estimated from the simulations put into redshift space. 



only considering the diagonal terms from the covariance ma- 
trix (as is commonly done), bootstrap errors are generally 
larger than both mock and jackknife errors, as was clear from 
i]4.2l already. However, this does not necessarily mean that 
bootstrap errors overestimate the uncertainty, as we need 
to take into account the correlation between errors as well. 
Clearly, in an analysis which ignores the correlation of the 
errors, the errorbars will be overestimated using the boot- 
strap method. It should be stressed that the values of logjQ 
Wp(rp) obtained from the bootstrap, unlike those from the 
jackknife, are certainly more Gaussianly distributed, with 
perhaps some hint of a non-Gaussian error distribution ap- 
pearing only for Tp > 25. 



4.4 Eigenvalues 

So far we have ignored any correlations between bins when 
presenting the errors. To correct for this omission, we now 
consider the first step in a principal component decomposi- 
tion of the covariance matrix, the distribution of normalized 
eigenvalues. The normalized eigenvalues quantify the contri- 
bution to the variance of the corresponding eigenvector. 

In the two panels of Fig. |4] we present the normalized 
eigenvalues for the covariance matrix of redshift space corre- 
lation function measurements, (,s{s) (left), and for the pro- 
jected correlation function, Wp(rp)/rp (right), in both cases 
as function of eigenvalue number. The eigenvalues are ranked 
in order of decreasing variance, with the first eigenvalue ac- 
counting for the largest variance. The results in this plot are 
dependent on the number of data points used. In this case. 



the correlation functions are estimated in logarithmic bins 
of pair separation with width of 0.2 dex. 

In each panel, the black line corresponds to the mock 
eigenvalues (which we call the "truth"), and the red lines 
indicate the eigenvalues obtained from the covariance matrix 
constructed using just one half of all the mock realizations. 
This provides an indication of the accuracy with which the 
mock eigenvalues are measured. Interestingly, the accuracy 
of the eigenvalues in this test is higher for Wp(rp)/rp (right) 
than it is for ^s(s) (left). This is most likely related to the 
fact that, within the range of scales considered, ^s(s) is more 
sensitive to Poisson noise than Wp(rp)/rp, as was the case 
for the results in Section [4.21 

Fig. 13] shows that in all cases, the eigenvalues decrease 
strongly in amplitude with increasing eigennumber. The first 
two eigenvalues alone typically account for 80 or even 90 % 
of the total variance. This indicates that a very strong cor- 
relation exists between the bins in the correlation function 
measurements. The shape of each eigenvalue curve is de- 
pendent on the correlation function itself. Somewhat sur- 
prisingly, ^s(s) appears to be a less correlated statistic than 
Wp(rp)/rp, as more eigenvalues are needed in the former case 
to represent the covariance matrix with the same level of fi- 
delity i.e. to the same total variance. Or, in other words, 
the eigenvalue curve is shallower for ^s(s) than it is for 
Wp(rp)/rp. Changing the range of scales or the number of 
data points used to represent the correlation function has 
only a marginal impact on the form of the eigenvalue curves. 

Eigenvalues derived from the bootstrap covariance ma- 
trix are shown by the green line in each panel of Fig. |4l with 
error bars indicating the scatter over the 100 mock data sets. 
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Here we show only the bootstrap eigenvalues obtained using 
27 sub-volumes. On average, the bootstrap method recovers 
the expected eigenvalue curve rather accurately. Although 
it is not a perfect match, the "truth" (i.e. mock) is always 
within the scatter presented for both clustering statistics. 
Note that changing the scale or the number of data points 
considered does not appear to influence these conclusions. 

The eigenvalues obtained from the jackknife covariance 
matrix are shown in Fig. |4] for two choices for the num- 
ber of sub-volumes, 27 and 64, shown by the blue and cyan 
lines respectively. Once again, error bars indicate the scat- 
ter in the eigenvalues obtained by applying this technique to 
each of the 100 data sets. Neither of the two jackknife mea- 
surements fully recovers the true eigenvalues. Interestingly, 
the more sub-samples that are used, the further away the 
eigenvalue curves move from the "truth". For ^s(s), this ap- 
pears to be mostly due to the much smaller first eigenvalue, 
which is ~ 0.6 ± 0.1 instead of ~ 0.85 ± 0.1. Furthermore, 
the slope of the eigenvalue curve is different than that ob- 
tained from the mock data sets and is very sensitive to the 
number of sub-volumes the data is divided into. We have 
checked that further increasing the number of sub-samples 
the data is split into simply exacerbates this problem. This 
is a first yet clear indication that the covariance matrix esti- 
mated from the jackknife technique is not equivalent to the 
"correct" mock estimate. However, these issues do not nec- 
essarily imply that the jackknife error estimate is incorrect, 
since the eigenvalues are just one part of the principal com- 
ponent decomposition. On the other hand, this highlights 
the limitation in how accurately the jackknife methodology 
can recreate the true underlying errors, as given here by the 
mocks. Most of the above remarks remain true for Wp(rp)/rp 
as well. As was the case with the bootstrap analysis, chang- 
ing the scale or the number of data points does not signifi- 
cantly influence any of the above conclusions. 

4.5 Eigenvectors 

After considering the eigenvalues of the principal component 
decomposition we now examine the associated eigenvectors. 
Together, the eigenvalues and eigenvectors completely de- 
scribe the full normalized covariance matrix. 

In Fig.[S] we show the first five eigenvectors derived from 
the covariance matrices constructed for the spherically aver- 
aged correlation function (^s(s), left), and for the projected 
correlation function (wp(rp)/rp, right). Note these eigenvec- 
tors are in the same order as the eigenvalues presented in 
Fig. 13] Only the eigenvector corresponding to the smallest 
eigenvalue is not shown, which contributes less than ~ 0.3 % 
to the total variance. The colour coding remains the same 
as that used in Fig. 3] with the contribution to the relative 
variance indicated in the bottom right corner of each panel. 
At first sight, the results are rather encouraging, as the mean 
eigenvectors for all error calculation methods overlap reason- 
ably well with that measured from the mocks (the "truth" , 
shown by the black line). The scatter on the mock result 
is indicated by the spread between the red lines, which, as 
before, shows the results obtained from splitting the mock 
data sets into two groups of 50. 

Let us consider each eigenvector for both 5s(s) and 
Wp(rp)/rp more carefully. The first and most important 
eigenvector (top left corner of both panels in Fig. (5)1 is very 



flat and is related to the uncertainty in the mean density 
of objects in the data sets. This uncertainty causes all the 
clustering measurements to move up and down more or less 
coherently. To within the quoted scatter on each clustering 
measurement, all of the first-ranked eigenvectors are iden- 
tical. However, some interesting trends can already be seen 
in this decomposition. For example, increasing the number 
of sub-samples used from 8 to 27 increases the difference 
between the mock and jackknife estimates, with the shape 
of the data inferred eigenvector tending to show less (more) 
correlation on small (larger) scales. This is a real effect that 
is further enhanced as one considers even smaller scales than 
shown here. 

The second eigenvector in Fig. [5] shown in the top mid- 
dle row of both panels, displays a strong scale dependence 
unlike the first eigenvector. The second eigenvector gives a 
much smaller contribution to the total variance, i.e. around 
the ~ 10 % level, as opposed to ~ 85 % for the first eigenvec- 
tor. The form of the second eigenvector reveals that small 
scales are anti-correlated with larger ones. It is worth noting 
that all three error methods yield eigenvectors which look 
very similar for both 5s(s) and Wp(rp)/rp. Increasing the 
number of sub- volumes decreases slightly the scatter in the 
recovered second eigenvector, and for a fixed number of sub- 
samples the scatter is marginally smaller for the bootstrap 
error estimates. Finally, it is worth noting that the correla- 
tions as function of scale, despite having different slopes for 
^s(s) and Wp(rp)/rp, are in fact very similar: the orientation 
of the eigenvectors is only determined up to a sign, so all 
the curves in Fig. [S] can be arbitrarily multiplied by -1. 

The remaining three eigenvectors plotted in Fig. [5]com- 
bined contribute less than a few percent of the total variance, 
and the smaller their contribution to the variance, the larger 
is the scatter from the different resamplings. The fifth eigen- 
vector (bottom right panel) certainly tends to be dominated 
by point to point variations in the data itself. This behaviour 
is particularly obvious when most of the eigenvector signal 
appears to come from adjacent points with opposite corre- 
lations, as seen in the lower panels for both statistics. 

Note that it is precisely this last point, whether or not 
the eigenvector provides a useful or important description of 
the data, that highlights the difficulty behind using a prin- 
cipal component analysis. When fitting a model to the data, 
we need to select how many principal components to use in 
the fit (see the example in Section [SJ. If features in the data 
are real then any theoretical model should attempt to repro- 
duce them, and the eigenvectors which encode these features 
should be retained. However, if any facet of the data is dom- 
inated by noise then we should not attempt to reproduce 
it, and we should omit the corresponding eigenvector. One 
can compare covariance matrices without worrying about 
whether or not to retain different eigenvector components; 
however, this issue has a major bearing on the success of 
model fitting. 

4.6 Stability of (inverted) covariance matrices 

So far we have compared the principal component decom- 
position of covariance matrices constructed using different 
techniques to what we know is the "right" answer - the co- 
variance matrix obtained from a set of mock data sets. Un- 
fortunately, in practice, it is rarely the case that one knows 
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Figure 5. Left: The normalized eigenvectors for the spherically averaged redshift-space correlation function, i;{s), as function of scale. 
Each panel corresponds to a different eigenvector, whose contribution to the total variance (in the mock case) is indicated on each panel. 
Colour and line styles are the same as in Fig. |4] Error bars show the l-cr scatter obtained from the 100 data sets. The range of physical 
scales considered in the principal component decomposition is indicated in the legend. Right: Same as left panel, but for the projected 
correlation function, Wp(rp)/rp, as function of projected separation. 



the correct answer beforehand. It would be useful, there- 
fore, to devise a statistic which will allow us to quantify the 
degree to which a given technique can recover a covariance 
matrix. Clearly no statistic can ever know the precision to 
which a measurement is actually correct, since that would 
require a priori knowledge of the "truth". But the statistic 
should at least be sensitive to the level of noise which re- 
mains in the covariance matrix estimate. In the absence of 
such a statistic at present, we consider the usefulness of a 
few well-known results on the stability of errors in our quest 
to determine the stability of the inversion of the covariance 
matrix. 

One of the easiest tests to check the stability of a covari- 
ance matrix inversion is to repeat the whole analysis using 
only the odd or even data points (i.e. the 1st, 3rd, 5th ... 
data points), and to check that the results remain within the 
quoted errors, and, furthermore, that no systematic shift has 
been introduced in either case. Similarly, the result should 
remain stable to a simple rebinning of the data. Note that 
the rebinning should be done at the level of the data it- 
self and not on the processed results. This is to ensure that 
non-linearities in the binning procedure do not introduce 
unwanted biases. In the case of 2-point clustering statistics, 
this means that pair counts should be rebinned. 

Porciani & Norberg (2006) considered a quantitative 
test for the bootstrap method, which consists of using 
values in order to assess the stability of the recovered co- 
variance matrix. Their method relies on quantifying how 
many components of the principal component decomposi- 
tion can be reconstructed for a fixed number of bootstrap 
realizations. The figure of merit is obtained by comparing 
the distributions for two different sets and numbers of 
bootstrap samples, as a function of the number of princi- 



pal components used. With this method, it is possible to at 
least make sure that the number of bootstrap realizations is 
large enough, but under no circumstance can we tell whether 
the method chosen has converged to the "true" covariance 
matrix. One nice feature of the method is that it can actu- 
ally show that certain components of the decomposition will 
under no circumstance be recovered accurately enough, in 
which case this gives an indication of how many components 
should actually be considered. 

Last but not least, the results should be stable w.r.t. 
the number of sub-samples considered (within moderation 
of course). This is something which has to be considered in 
both bootstrap and jackknife analyses, and remains proba- 
bly one of the better ways to at least attempt to show that 
the results obtained with internal errors estimates are as 
accurate and realistic as possible. 

In SectionO we put the above remarks into practice and 
show that each of them are able to fix or highlight different 
and often problematic issues in the error analysis using in- 
ternal estimates. We note here that it is due to the numerous 
mock realizations that we are actually able to discover, on a 
more rapid timescale, which of the different methods is the 
more promising. Certainly the task would be much more 
difficult in their absence. 



5 A CASE STUDY: COSMOLOGICAL 
PARAMETER ESTIMATION 

In this section we illustrate the different error estimation 
methods introduced and discussed in Sections[2]and|4]with a 
case study involving the projected correlation function. The 
aim is to infer two basic cosmological parameters, the mat- 
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Figure 6. Left: The projected dark matter correlation function, estimated from a set of non-linear power spectra with Co = 0.25 and 
0.30 < eg < 2.00, in steps of (5(Tg = 0.02 (see text for details). Like the results presented in Section [4.1l the projected correlation functions 
are calculated at ^ = 0.50. Right: The same as the left panel, but now assuming erg = 0.90 and 0.05 < r2o < 0.99, in steps of <5f2o = 0.02 
(see text for details). 



ter density, f2o, and the amplitude of density fluctuations, 
cTg, using measurements of the projected correlation function 
made from the mock data sets between 1 and ~ 22 h~ Mpc. 
In this example, because we are using data sets extracted 
from N-body simulations, we know the true values of the 
parameters beforehand, and can therefore test the perfor- 
mance of the error estimators. 



5.1 A grid of theoretical Wp(rp) models 

We first describe how to compute a grid of models for the 
projected correlation function using linear perturbation the- 
ory. First, we construct a grid of linear perturbation the- 
ory power spectra, using the parameterization of the cold 
dark matter transfer function introduced by Eisenstein & 
Hu (1998). This parameterization is an approximation to 
the more accurate results generated from Boltzmann codes 
such as CAMB (Lewis & Chalhnor 2002; see Sanchez et al. 2008 
for a comparison) . The initial conditions for the L-BASICC 
simulations were computed using CAMB, so we would expect a 
small systematic error in the values of the recovered cosmo- 
logical errors. However, for this application, the measure- 
ment errors are much larger than these systematic errors, 
and we use the Eisenstein & Hu (1998) equations for speed 
and simplicity. 

The parameters varied to construct the grid are the nor- 
malization of the power spectrum, erg (over a range = 0.3 
to 2.0 in steps of 0.02) and the present day matter den- 
sity parameter S7o (covering the range = 0.05 to 1 in steps 
of 0.02). The other cosmological parameters are held fixed 
at the values used in the simulation, as described in Sec- 
tion [3TT] The power spectra are output at z = 0.5 to match 
the redshift of the simulation output. The next step is to 



produce an approximate estimate of the non-linear matter 
power spectrum, using the HALOFIT code of Smith et al. 
(2003), which has been calibrated against the results of N- 
body simulations. Finally, we transform the power spectrum 
into the projected correlation function. The relation between 
the real-space correlation function, £,r{r), and the projected 
correlation function is well known: 



Using the fact that the power spectrum is the Fourier trans- 
form of the correlation function, we can replace £,r{r) to 
obtain, in the limit that TTmax — * oo: 

^!^A[A = — !— /°°fcP(fc)Jo(fcrp)dlnfc . (15) 
rp 27rrp Jg 

Note that if a finite value of TTmax is used to obtain the mea- 
sured projected correlation function, then strictly speaking, 
Jo in Eq. [15] should be replaced by a finite integral over 
sin(fcr)/(fcr). Instead we correct for the small systematic er- 
ror introduced by retaining vTmax — + oo by forcing the the- 
oretical prediction for flo = 0.25 and erg — 0.9 to agree 
with the measurement from the full simulation volume; this 
correction is applied to all the projected correlation func- 
tions on the grid and works well, which is clear since the 
returned best fitted values are still centred on the expected 
values. The dependence of the projected two-point correla- 
tion function on Qq and erg is illustrated in Fig. [6] The left 
hand panel shows the dependence of Wp(rp)/rp on erg for f2o 
fixed to the value used in the L-BASICC simulation, while 
the right hand panel shows the dependence on Qq with erg 
fixed to the simulation value. 
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Figure 7. Ax^ = 6.17 contours (i.e. 95 % confidence interval 
for a 2-parameter fit in the case of Gaussian distributed errors) in 
the flo-ag plane, for three hundred and one error estimates. The 
most realistic error estimate is given by the mock errors (thick 
black), while red (dotted) correspond to bootstrap errors with 27 
sub-samples, blue (short dashed) to jackknife with 27 sub-samples 
and cyan (long dashed) to jackknife with 64 sub-samples. From 
this rather 'messy' plot, it is clear that internal error methods 
do not necessarily return the correct errors. The fits projection 
correlation function fits are done over the scales indicated in the 
legend. 

5.2 A straightforward 2-parameter model fitting? 

Armed with our grid of theoretical models we are now ready 
to test the internal error estimators against the external es- 
timate, the mocks, which we regard as the "truth" . We note 
that this is an idealized case, which is perfect for assessing 
the performance of the internal estimators: we know the true 
values of the cosmological parameters and we know that our 
model should provide a very good description of the mea- 
surements, at least over the range of scales used in the fit. In 
a more realistic situation, the complications of galaxy bias, 
sample selection and how well we could model these effects 
would have an impact on performance (see for example An- 
gulo et al. 2008). 

As this is such a straightforward fitting problem, we 
choose in Fig. [7] to plot the raw 95 % confidence interval 
contour^ in the Qq-cs plane for three hundred and one error 
estimates. Remember that in a normal situation with 1 data 
set, we would only be able to obtain three estimates of the 
error, whereas in our case we have 100 data sets and can 
therefore study the distribution of internal error estimates. 
The black solid contour corresponds to the 95 % confidence 

^ In this section, we will refer to confidence interval levels as if 
the errors are Gaussianly distributed. Technically, we consider the 
confidence intervals (CI hereafter) which correspond to Ax^ = 
2.31 or 6.17 for 2-parameter fits, and sometimes to Ax^ = 1 or 4 
for 1-parameter fits. 



interval inferred from the mocks, which we consider as the 
"truth" or the benchmark error that the other methods are 
trying to match. The red-dotted lines correspond to Boot- 
27 (with Nr—Nsub), blue-short dashed to Jack-27 and cyan- 
long dashed to Jack-64. The rather messy nature of this 
plot tells us that there is little agreement between each of 
the error methods and even between different realizations 
using the same estimator. 

Two generic trends can be seen in Fig. [T] First, not 
all contours are centred on the true values used in the sim- 
ulations of Qq = 0.25 and erg = 0.90, but appear to be 
systematically shifted around. This is the effect of sample 
variance, also more commonly (but less correctly) referred 
to as cosmic variance. Indeed, each data set, despite being 
very large (a cube of side 380 Mpc), is sensitive to the 
large scale density fluctuations present in the much larger 
simulation volumes. Irrespective of the error method used, 
the distributions of best flt fio values are all symmetric, but 
not exactly Gaussian (the central part being slightly more 
peaked than a Gaussian with similar dispersion) and with 
mean values less than a 5r2o-binsize away from the simula- 
tion value. The distributions of best fit as values are all very 
well described by Gaussian distributions, with similar dis- 
persion and with mean values within a 5crg-binsize from the 
ICC1340 simulation value. Second, some of the error con- 
tours from the internal estimators are much larger than the 
mock error contour, while others are clearly much smaller. 
This, on the other hand, is of concern for internal error es- 
timates and merits therefore closer investigation. 

To further quantify the conclusions drawn from Fig. [T] 
we construct a relative area statistic, defined as the area 
encompassing the 95 % confidence interval in an internal 
error estimator divided by the area of the external, "true" 
mock error estimate, A / Amock, for the same number of 
principal components. The motivation behind this statistic, 
hereafter also referred to as the normalized figure of merit, 
is twofold: 1) for a 2 parameter model it is natural to make 
comparisons between different error estimators using the full 
parameter plane; and 2) due to its dimensionless nature, this 
statistic is easily interpreted in terms of confidence levels 
(see below). A drawback of this figure of merit is that it 
does not account for uncertainties in the determination of 
the mock area/contours, which, given the number of data 
sets available, can not be neglected. A zeroth order estimate 
of that effect is given by a comparison between the following 
three mock results: those obtained using all 100 mocks, the 
first 50 and the last 50 respectively. Additionally, it is not 
possible to estimate this statistic for real data, but only for 
slight variants of it. 

In Fig. [8] we plot the distribution of the relative area 
statistic. The case of Boot-27 (with Nr=Nsub) is shown in 
red, results for Jack-27 and Jack-64 are shown in blue and 
cyan respectively, while the black vertical line indicates the 
optimal value of the mock. The arrows show the median of 
each distribution. The difference between the two panels is 
the binning used for Wp(rp)/rp, and hence, indirectly, in the 
number of principal components considered. 

Focusing on the left panel of Fig. |8] first, we see that 
for most data sets, the internal error estimates (of which 
three hundred are presented in this figure) tend to overes- 
timate the 95 % CI area, on average by factors of ~ 1.9, 
~ 1.3 and ~ 0.95 for Boot-27, Jack-64 and Jack-27 respec- 
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Figure 8. The distribution of tlie normalized figure of merit, A / J^modk (see il5.2l for details). Boot-27 results are shown in red, Jack-27 
and Jack-64 are in blue and cyan respectively. The arrows indicate the median of each distribution, while the dotted curves show the 
corresponding Gaussian with mean and variance indicated in the panel. In the left (right) panel, the number of principal components 
used is six (thirteen) out of the six (thirteen) data points used to cover the range of scales considered. 



tively. Furthermore, the variation in the area of the error 
contours is substantiah the central 68 % of values is typi- 
cally spread over 0.15 in logio A / Amock, i.e. a factor of 
~ 1.4 in the area of the error contour. Hence, even for this 
particularly simple case, the uncertainty on the internal es- 
timate of the CI is large and certainly not negligible. As seen 
in earlier cases, the difference between Jack-27 and Jack-64 
is also quite large, with Jack-64 yielding a marginally more 
centrally concentrated distribution than Jack-27, whereas 
Jack-64, on average, overestimates the true area by an ad- 
ditional 30 %. The situation is clearly worst for bootstrap 
errors, which display by far the largest systematic offset, but 
with a spread that looks amazingly similar to jackknife re- 
sults. Interpreting these offsets in the framework of Gaussian 
errors for this figure of merit, we conclude that on average 
a 95 % boot-27 CI corresponds effectively to a 99.6 % CI, a 
95 % Jack-64 to a 98.0 % CI, and a 95 % Jack-27 CI to a 
~90 % CI. This is without taking into account the spread 
in relative areas from different realizations. 

Unfortunately, this is not the end of the confidence in- 
terval story. We now consider the right hand panel of Fig. |8l 
which is equivalent to the left except that 13 bins instead 
of 6 have been used for Wp(rp)/rp (over the same range of 
scales) and hence all 13 principal components are used in 
the fitting, instead of just 6 as in the left panel. The dif- 
ference in the figure of merit between right and left panels 
is clear, with the offsets of two of the three error methods 
changing in a systematic way. The right panel area ratios 
are biased by factors ~ 1.5, ~ 1.3 and ~ 0.7 for Boot-27, 
Jack-64 and Jack-27 respectively. The only positive note is 
that all distributions plotted are well described by Gaus- 
sians, shown by the dotted lines, with small dispersions. In- 
directly, we conclude that on average an estimated 95 % 
CI corresponds, with 95 % confidence, to a true CI in the 



range 79.2 % - 99.8 % for the case considered here. We note 
that only Jack-64 seems to remain stable w.r.t. the change 
in binning. This is most likely related to the results found 
by Hartlap et al. (2007). They showed that internal error 
estimators tend to be systematically biased low on average 
when the ratio of the number of bins considered to the num- 
ber of samples becomes too large (typically 20 % or more) . 

The differences between the distributions of error con- 
tour areas shown in the left and right panels of Fig. |S] is 
disconcerting, since only the number of data points used 
has changed (i.e. the binning of Wp(rp)/rp over a fixed range 
of scales). For mock errors, we measure an increase of 15 % 
in the area covered by the 95 % confidence intervals when 
changing from 6 to 13 data points. This is much smaller 
than the systematic shifts displayed by the internal error es- 
timates under the same circumstances, which present a 30 % 
to 80 % change. It is unlikely that these strong shifts are due 
to the nature of the data alone, but much more likely to be 
due to errors propagating through the full covariance matrix 
analysis. In both panels, the fits are performed using the full 
covariance matrices, and therefore we investigate in the next 
section if noise in the covariance matrix is the root of the 
problem. 



5.3 Dependence of confidence intervals on Npca 

Perhaps the large variation in the error contours returned 
by the internal estimators is due to noise in the covariance 
matrix, resulting from too many principal components being 
retained in the analysis. 

To test this idea, we show in Fig.[5]the impact on the rel- 
ative area statistic of varying the number of principal com- 
ponents for a fixed number of data points. In the left panel 
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Figure 9. The 16"^, 50"^ and 84**^ percentiles of tlie relative area statistic (see i|5.2l for a definition) as function of the number of 
principal components used in the fit: in each case the figure of merit is normalized to the mock area with the same number of principal 
components. Bootstrap errors from 27 sub-volumes (with Nr=Ng^i,) are shown in red, while jackknife errors are given in blue and cyan 
for 27 and 64 sub- volumes respectively. In the left (right) panel, six (thirteen) equally spaced data points for Wp(rp)/rp are considered 
in the model fitting and cover the full range of scales indicated in the key. Finally, the black line indicates A^^^^^/A^^^^Jj, with N^cf = 3 
in both panels. 



of Fig.ini 6 data points are used to represent Wp(rp)/rp. The 
variation of the 95 % confidence interval area in the exter- 
nal mock estimate w.r.t. tfie area with Npca = 3 is plotted 
as a black line. Tliis is a measure of how stable the mock 
results are as function of the number of principal compo- 
nents. There is a modest reduction in the size of the error 
contour, ~ O.ldex, on using 6 principal components instead 
of 3, as shown by the shallow slope of the line. The situation 
is slightly more unstable in the right panel (where 13 data 
points are used instead of 6 as in the left panel), with up 
to a 40 % change over the full range of numbers of prin- 
cipal components considered. This variation is significantly 
reduced if one starts with at least 5 principal components, 
where the change in the figure of merit over the full range 
is then less than ~ 20 %. 

In Fig. [5] we also show the median, and 16* and 84"^ 
percentiles of the relative area distributions for Boot-27 
(with Nr=Nsub), Jack-27 and Jack-64 estimates in red, blue 
and cyan respectively. The impression gained from the two 
panels is rather different, especially for Boot-27 and Jack-27. 
While the main properties of the relative area distributions 
for those two estimators remain roughly the same as function 
of Npca in the left panel (very weak dependence on Npca), 
they change systematically and significantly with Npca in 
the right panel. These changes result in unstable error esti- 
mates, most likely due to the propagation of a bias through 
the covariance matrix estimate (see Hartlap et al. 2007). 
Only Jack-64 seems to return a somewhat stable results as 
function of Npca- 

Finally, in Fig. [TO] we examine the fraction of "outlier" 
experiments as function of the number of principal compo- 
nents used in the fit. An outlier is defined as a data set 



for which the error contour from an internal estimate (de- 
fined by Ax^ ~ 6.17) does not include the true underlying 
cosmology (i.e. Qq ~ 0.25 and as ~ 0.90). For a Gaussian 
distribution, we would expect no more than 5 "outliers" de- 
fined in this way out of 100 data sets. The left panel looks 
reasonable for most Npca, especially considering the number 
of realizations available: remember we have just 100 data 
sets at our disposal, so the Poisson noise on the expectation 
value of this statistic is non-negligible. There is still a ten- 
dency from this panel to say that Boot-27 (with Nr—Nsub) 
overestimates the errors and Jack-27 underestimates them. 
This is in relatively good agreement with our conclusions 
from Fig. (5] In the right hand panel the situation is sig- 
nificantly different. First all three estimators present radi- 
cal changes in the number of "outliers" as function of Npca- 
Clearly we have to exclude fits which use too many principal 
components: a cut can therefore be made at about 10 prin- 
cipal components for Boot-27 and Jack-64, while the cut has 
to be made at about 6 or 7 components for Jack-27 already. 

Neither Fig- nor Fig. llOl can be made without the full 
battery of data sets we have, nor without the knowledge 
of what the "truth" is. However, we can nevertheless learn 
some simple tricks which will be useful for the analysis of real 
data- For example, for each internal error estimator we can 
plot the relative area statistic, by using a fiducial reference 
value for the area. If this quantity varies in a significant man- 
ner as function of Npca or as a function of the internal error 
estimator used, we are made aware of a change of regime, 
without necessarily knowing what to do- The most likely 
and robust solution will be to introduce less components in 
the fit- 
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Figure 10. The fraction of outliers as function of tlie number of principal components used in the fit. An outlier is defined as a 
2-parameter fit for which the Ax^ = 6.17 contour of the data inferred errorbar does not enclose the true underlying cosmology, i.e. 

= 0.25 and a-g = 0.90 (see i|3.1l l. The horizontal line corresponds to the theoretical expectation for the contour Ax^ = 6.17, i.e. 95 % 
assuming Gaussianly distributed errors. The left (right) panel corresponds to bins of 0.2 (0.1) dex width. 



5.4 Important considerations for error analyses 

We summarize here a few results from our simple case study. 
To recap the exercise consisted of fitting the projected cor- 
relation function (on scales between 1 and ~ 22 Mpc), 
using what we knew beforehand was the correct theoretical 
model and which was only dependent on the parameters f2o 
and (Tg, to one hundred projected correlation function esti- 
mates, taken from one hundred data sets of 380 Mpc on 
a side, extracted from 50 totally independent N-body sim- 
ulations of 1340 /i"^ Mpc on a side. The main conclusions 
for internal error estimates are, with confidence interval ab- 
breviated as CI: 



• the average systematic offset encountered on an esti- 
mated 2-a CI implies it can be mistaken for a 1.6-a to 2.9-(7 
CI in reality. 

• the spread in errors for all internal error estimates is 
large: for an unbiased estimator, a 2-(t CI corresponds to a 
real CI in the range 1.3-cr to 3.1-(t, at the 95 % confidence 
level. 

• the CI from bootstrap errors, estimated with Nr=Nsub, 
tend to be systematically more biased than jackknife errors. 

• the true 95 % CI, as measured from one hundred mocks, 
is known to ~ 20 % accuracy. 



The above remarks depend on the mean number density of 
the sample, here fixed to mimic a L* galaxy sample (see 
ij3.1[> . In the next section, we indirectly show how the am- 
plitude of the errors scales w.r.t. the mean number density 
considered. 



5.5 An improvement to the bootstrap recipe 

Throughout our comparisons we have seen that the boot- 
strap gives errors that are systematically larger than the 
"truth" (see for example Figs. [2] and |8]). Previously, we have 
set Nr=Nsub, i.e. the number of sub- volumes chosen with 
replacement was equal to the number of sub-volumes each 
data set is divided up into. This is an arbitrary choice and 
there is no reason why we cannot increase the number of 
sub- volumes used to define each realization of the data set. 
Here we consider varying the number of sub-volumes chosen. 
We also consider increasing the number of sub- volumes the 
data set is split into, which affects both the bootstrap and 
jackknife methods (recall that in the jackknife, the choice of 
Nsub sets the number of realizations we can generate). 

In the left panel of Fig. [11] we show the impact of chang- 
ing the number of sub-volumes chosen on the relative vari- 
ance obtained from the bootstrap with A^sub= 27. Two fam- 
ilies of 5 different coloured curves are shown: the top ones 
correspond to a sampling fraction of 10 % of the standard 
mean number density, while the bottom ones corresponds 
to a 25 % sampling fraction. Each family is composed of 
three bootstrap samples (A'^,.=l, 2, 3 N^ut, in green, blue 
and cyan respectively) and three mock curves: all mocks in 
black, and two red curves for the 50 first and 50 last re- 
spectively. The relative variance decreases as the number 
of sub- volumes selected increases, with the biggest change 
coming when the number of sub-volumes is oversampled by 
a factor of two (blue line). With an oversampling of a factor 
of 3 (cyan line) , the bootstrap errors are in very good agree- 
ment with those derived from the external estimate. If we 
increase the oversampling rate further, the relative variance 
returned by the bootstrap becomes too small (not shown for 
clarity). From this figure, an oversampling factor of about 
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Figure 11. Relative variance for i;r{s) (to be compared with the left panel of Fig.fjjl. Left: for N^y_i,= 27, we compare, for two sampling 
fractions (upper group of lines: 10 %, and lower group of lines: and 25 %), the relative variance for different number of bootstrap 
resamplings: Nr= 1, 2, and 3 N^^i, from top to bottom (green, blue and cyan lines respectively). The black and red lines have the same 
meaning as in Fig. [2] with the top (bottom) ones corresponding to a sampling fraction of 10 % (25 %). Right: same as left panel, but for 
N,„b= 125. 



three seems to be optimal, i.e. selecting at random, with re- 
placement, Nr=3 Nsub sub- volumes from the original list of 
length Ni,ub- 

We note that the left panels of Figs. [2]and llll in which 
the density of objects in the data set is varied, give extremely 
valuable information on how the relative errors scale for sam- 
ples with different mean densities, as together they span an 
order of magnitude variation in the mean density. Increas- 
ing the mean density by a factor 2.5 leads to a reduction in 
the relative variance error by ~ 80 % (top versus bottom 
black lines in Fig. Ilip . while a mean density 4 times larger 
decreases the relative error by a further 50 %, at which stage 
the change becomes more scale dependent. Indeed, the shape 
of the mean relative variance in the left panel of Fig. [2] is 
slightly different to those in Fig. 1111 We attribute this to the 
fact that there is a limiting number density beyond which 
the errors are not any longer dominated by sample shot- 
noise. 

The right panel of Fig. [TT] shows the same information 
as the left panel, but for a different number of sub-samples: 
125 instead of 27. The remarkable point to realize here is 
that both panels are virtually identical, leading to the con- 
clusion that the precise number of sub-samples the data is 
split into is not a primary factor for determining the size 
of bootstrap errors. This is in strong contrast to jackknife 
errors for which we observed, in Fig. [2] a systematic change 
in the relative variance as function of scale with respect to 
the mocks, with increasingly discrepant results on smaller 
scales for the jackknife analysis with increasing number of 
sub-samples. 

So on adopting this essentially new prescription for the 
bootstrap error analysis, i.e. allowing for bootstrap resam- 
plings of typically three times the number of sub-samples 



considered {Nr=3 Nsub), what is the impact on the other 
results presented in Sections 3] to [Sf This question is best 
answered in Figs. [T2] and 1131 which show the same infor- 
mation as Figs. [7] and [S] but for bootstrap estimates with 
iV,.=3 Nsub- 

Fig.[T2]is a great contrast to the messy Fig.[7l this time 
internal error estimates seem to return CI which are in much 
better agreement with the "truth" . We show, together with 
the mock CI in black, two sets of bootstrap estimates, each 
applied to one hundred different data sets; Boot-27 (in red) 
and Boot-125 (in green) with Nr=3 Nsub- We also note that 
the mock contour is on average ~ 50 % larger than in Fig. [T] 
This is due to the 25 % lower number density considered in 
this case, which is in good agreement with the findings of 
Fig. 111! for which we also found a typical ~ 50 % change in 
error w.r.t. the reference mean number density. 

Fig. [13] present the distributions of the relative area 
statistic for a suite of internal error estimators, all estimated 
over the range 1 to ~ 22 Mpc, with Npca = 6. In the 
left panel, we show the distributions of the standard boot- 
strap estimates with Nr—Nsub and those of the correspond- 
ing jackknife estimates: Boot-27 in red, Boot-125 in green, 
Jack-27 in blue, Jack-64 in cyan and Jack-125 in magenta. In 
the right panel, we show the distributions for bootstrap with 
Nr=3 Nsub- Boot-27 in red, Boot-64 in blue and Boot-125 
in green. 

Fig. ll3l makes several points. First it becomes clear from 
the right panel that the number of sub-samples drawn, with 
replacement, from a set of Nsub sub-samples has to be typ- 
ically three times larger than Nsub for bootstrap errors to 
best reproduce the "true" mock inferred errors with mini- 
mal bias. Secondly, once that regime is reached, bootstrap 
errors are only very marginally dependent on the number of 
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sub-samples the data set is split into, i.e. once the criteria on 
the number of data points discussed in Hartlap et al. (2007) 
is satisfied. For our clustering statistics, this seems to be 
true once A^sut is at least 5 times larger than Npca- Thirdly, 
the log-normal variance is limited to cr ~ 0.15. Hence in the 
absence of bias, the area of the CI are known to within a 
factor of 2 (with a 95 % confidence level), explicitly meaning 
that an estimated 2-cr CI can be mistaken to be pessimisti- 
cally a true ~ 1.3-(t or optimistically a true 2>-a. Finally 
the observed systematic bias seen in Fig. [13] is actually less 
important than it looks, as the uncertainty on the mock 
error is larger for the number density considered here. For 
/ = 0.25, we observe a ~ 20 % scatter on the mock error, 
which is more than twice as large compared to our findings 
with / = 1.0 (see 

Additionally, the left hand panel of Fig. 1131 once com- 
pared to Fig. [S] shows that the dispersion on the internal 
error estimates seem to be rather insensitive to the actual 
number density of the samples analysed. Decreasing the 
sampling rate down to 25 % of the original (as in Fig. I13p 
makes barely any difference to the quoted dispersions, as 
long as the distributions of relative areas are still well de- 
scribed by Gaussians. On the other hand, the bias seems to 
be a much stronger function of the assumed mean number 
density, with lower number densities tending to systemati- 
cally overestimate the errors by quite a significant amount. 
This is explicitly shown by the large mean values of each rel- 
ative area distribution shown in the left panel, ranging from 
~ 1.3 to as large as ~ 6.6 in the case of Jack-125. The bias 
increases by 1.5 to 2.5 times, depending on the estimator, 
when the mean density becomes 4 times smaller. When the 
bias becomes so large and the changes so unpredictable, it 
is no longer clear whether there is any positive side to be 
seen with these traditional internal estimators. 



6 SUMMARY AND CONCLUSIONS 

In this paper we have carried out an extensive comparison 
of the relative performance of internal and external error es- 
timators for two-point clustering statistics. We devise a set 
of numerical experiments, extracting independent data sets 
from N-body simulations. The data sets are chosen to have 
a volume comparable to that of the typical L* volume lim- 
ited samples constructed from the SDSS. The benchmark for 
this exercise is the error estimate obtained from the scatter 
over our independent data sets, which we refer to as "mock" 
errors or "truth". This is then compared to internal esti- 
mates made using the jackknife and bootstrap techniques 
(e.g. Tukey 1958; Efron 1979). We revisit the assumptions 
and the free parameters behind these techniques to see if we 
can lay down a recipe that would reproduce the more ex- 
pensive (and not even always possible) external errors. We 
summarize below our findings, starting with uncorrelated 
statistics, followed by covariance matrix based results and 
ending with the conclusions drawn from a case study, aimed 
at fitting two cosmological parameters to our clustering re- 
sults. 

Perhaps surprisingly (as they are so widely used in the 
literature), we find that both internal estimators of the er- 
rors have worrying failings. Neither was able to faithfully re- 
produce the relative variance of the external estimate, over 
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Figure 12. Same as Fig. [3 but for Nr=^ The "true" 

error estimate is given by the mock errors (thick black), while red 
(dotted) correspond to Boot-27 and green (dot-dashed) to Boot- 
125. This plot is a great contrast to the much messier Fig. [7] The 
fits are done over the projected scales indicated in the figure. See 
text for further comments. 



a limited range of scales (0.5 to 25 Mpc). At face value, 
the standard bootstrap variance is nearly 50 % larger on all 
scales, where standard refers to the case for which the num- 
ber of sub-samples selected at random, with replacement, 
equals the number of sub-samples the data set is divided 
into (i.e. Nr^Nsub)- In ^5.5\ we solve this problem of the 
overestimated variance simply by increasing Nr by a factor 
of three, i.e. by setting Nr=3 Nsut- The variance measured 
with the jackknife method is fairly accurate on large scales 
(greater than 10 Mpc), but the variance on small scales 
(less than 2-3 Mpc) is clearly overestimated by a sig- 
nificant amount and in a method dependent way: the bias 
depends strongly on the number of sub- volumes the data set 
is split into, as explicitly shown in Fig.jS] Another major re- 
sult from Section [4.31 is that all error estimators considered 
present Gaussianly distributed errors on the smallest pro- 
jected separations. However, slowly but surely, the error dis- 
tributions become non-Gaussian already on 6 (15) Mpc 
and larger for the jackknife (bootstrap) method, while they 
all remain Gaussian for the mocks over the full range of 
scales considered. 

The details of the recovered covariance matrices, or 
more precisely the recovered principal component decompo- 
sition of the covariance matrices for our 100 data sets, show 
equally worrying features in Sections 14.41 and 14.51 Generally 
speaking bootstrap inferred (normalized) eigenvalues and 
eigenvectors are in good agreement with mock inferred ones, 
while jackknife inferred (normalized) eigenvalues and eigen- 
vectors present distinctive features which are not present 
in the mocks and, furthermore, are also dependent on the 
number of sub-volumes the data set is split into. These fea- 
tures are particularly sensitive to the smallest scales used: 



SAGS-I: Robust error estimation for 2-point clustering statistics 21 



0.5 



0.4 



0.2 



0.4 0.6 0.1 



_ mock; n^,, 
boot: n.... 



0.2 - 



0.1 



=27 



JK: n,, 
JK: n. 



=64 
= 125 



-0.5 



2 4 

I I I I 

1.0 < r/h-'Mpc < 22.4 
/.t=3.52; 0=0.22; N,= l n,„, 
At=4.5S: CT=0.I4: N,= ] n„| 
/J,= \A3: <T = 0.16: N^=l n,„, 
/i=3.07; £7=0.17; N,= l n„, 
/i=6.64; a=0.2Z; N,= ] 





login A / 



0.5 



0.5 



0.4 



0.2 



0.4 0.6 0.8 



S'0,2 



0.1 



20.3 - 



mock: n, 
boot 
boot: n„ 
boot: n., 



pea 

n.„.=27 



=64 
= 125 



-0.5 



—I I I I I 

1.0 < r/h-'Mpc < 22.4 
M=0.66; CT=0.17; N^=3 n,„^ 
M=0.91; (7=0.14; N,=3n,„„ 
fj.=0.aO; £7=0.15; N^=3n,„^" 



0.5 



log,„ A / A„„, 



Figure 13. The distribution of the figure of merit, similar to Fig.O but for Nr=Ngj^i, (left panel) and Nr=3 Afgi,(,(right panel), assuming 
a mean number density of 25 % L* (as opposed to a L* mean number density as in Fig. [S]!. The figure legend gives the exact meaning 
of all the lines. The dotted lines are the corresponding Gaussian distributions with similar variance and mean. Increasing the number of 
sub-samples randomly drawn, with replacement, from the set of Ng^i, sub-samples radically changes the relative area distributions. 



as for the variance estimates, the jackknife method becomes 
increasingly discrepant w.r.t. the mocks when scales smaller 
than 2-3 h'^ Mpc are used. However, the direct influence of 
those subtle differences in a such technical analysis is hard 
to grasp and hence a better understanding in reached on an 
example scenario. 

In Section[5]we present a case study in which the differ- 
ent error estimation methods are used to extract constraints 
on cosmological parameters from measurements of the pro- 
jected correlation function, using 100 independent data sets. 
The discrepancy in the relative variance returned by the 
different methods, as described above, propagates through 
this analysis and gives different constraints on the fitted pa- 
rameters. We quantify these differences in terms of the area 
of the confidence interval which would correspond to 2-a 
for the case of a two parameter fit for Gaussian distributed 
measurements. With 100 independent data sets, we find that 
the internal estimators return, on average, error ellipses with 
larger area than that found in the mocks, resulting indirectly 
into a redefinition of the confidence limit of the measure- 
ment. This is particularly true for standard bootstrap esti- 
mates, for which the number of sub-samples selected at ran- 
dom with replacement equals the number of sub-samples the 
data set is divided into (i.e. Nr=Nsub)- However, we show 
in ijS.SI that increasing the number of sub-samples drawn at 
random by a factor of three (i.e. A'^r=3 Nsub) solves most 
problems: in that case, the confidence intervals are only 
marginally different to the mock ones. In the case of jack- 
knife errors, the area of the error ellipse is to some extent 
sensitive to the number of sub- volumes the data set is split 
into. For all error estimators, we find, as expected, that the 
error ellipse area is sensitive to the number of principal com- 
ponents used in the analysis. 

The diagnosis for the internal estimators is therefore 



mixed. The jackknife method has problems recovering the 
scale dependence of errors and the results are sensitive to 
the number of sub-samples the data set is split into. There 
is little scope to fix these problems given the definition of 
the method; the only thing we can vary is the number of 
sub-samples into which the data set is split. We did not 
find one choice for the number of sub-samples which could 
cure all of the ailments of this method. The prognosis for 
the bootstrap is on the other hand more encouraging. The 
problem of overestimating the variance can be traced to the 
effective volume of the data set used when the number of 
sub- volumes chosen at random with replacement is equal to 
the number of sub- volumes the data set is divided into. By 
oversampling the sub-volumes, this problem can be fixed, 
with the effective volume used tending to the original data 
set volume. Better still, for our application at least, there 
appears to be an optimal factor, three times, to oversample, 
with higher rates producing too little variance. 

Unfortunately there seems to be no hard and fast rules 
for the best way to set about a principal component analysis 
of the covariance matrix of clustering measurements. The 
value of the principal component analysis is that it helps 
break down the information contained in a clustering mea- 
surement. The measurement will be expressed in a restricted 
number of bins. The choice of the number of bins is arbitrary. 
Using more bins does not necessarily imply that there will 
be more information in the correlation function. The PGA 
breaks the covariance matrix down into eigenvectors. These 
are ranked in terms of how much information or variance 
they contain. The variance drops quickly with the order of 
the eigenvector for the examples we considered, indicating 
that most of the information in the clustering measurements 
can be broken down into a few terms. The best advice we 
can give here would be to compare the results obtained using 
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different numbers of eigenvectors and choose a value where 
the results and conclusions do not change significantly. 

The analysis presented in this paper is applicable to any 
galaxy or cluster survey with three dimensional information. 
Some of the issues discussed relating to redshift space dis- 
tortions are particular to local surveys in which the distant 
observer approximation does not hold. A new set of experi- 
ments would be required to extend our results to the calcula- 
tion of errors for photometric surveys, which look at angular 
clustering, or to multi-band photometric surveys, which will 
use photometric redshifts to look at clustering in redshift 
slices. The projection involved in these catalogues changes 
the underlying statistics, making them look more Gaussian 
perhaps. This is the most likely explanation why Cabre et al. 
(2007) find such a good agreement between jackknife and 
mock errors for angular clustering statistics. 
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